Goto

Collaborating Authors

 Headaches



Orchestrator Multi-Agent Clinical Decision Support System for Secondary Headache Diagnosis in Primary Care

Wu, Xizhi, Garduno-Rapp, Nelly Estefanie, Rousseau, Justin F, Thakkallapally, Mounika, Zhang, Hang, Ji, Yuelyu, Visweswaran, Shyam, Peng, Yifan, Wang, Yanshan

arXiv.org Artificial Intelligence

Unlike most primary headaches, secondary headaches need specialized care and can have devastating consequences if not treated promptly. Clinical guidelines highlight several 'red flag' features, such as thunderclap onset, meningismus, papilledema, focal neurologic deficits, signs of temporal arteritis, systemic illness, and the 'worst headache of their life' presentation. Despite these guidelines, determining which patients require urgent evaluation remains challenging in primary care settings. Clinicians often work with limited time, incomplete information, and diverse symptom presentations, which can lead to under-recognition and inappropriate care. We present a large language model (LLM)-based multi-agent clinical decision support system built on an orchestrator-specialist architecture, designed to perform explicit and interpretable secondary headache diagnosis from free-text clinical vignettes. The multi-agent system decomposes diagnosis into seven domain-specialized agents, each producing a structured and evidence-grounded rationale, while a central orchestrator performs task decomposition and coordinates agent routing. We evaluated the multi-agent system using 90 expert-validated secondary headache cases and compared its performance with a single-LLM baseline across two prompting strategies: question-based prompting (QPrompt) and clinical practice guideline-based prompting (GPrompt). We tested five open-source LLMs (Qwen-30B, GPT-OSS-20B, Qwen-14B, Qwen-8B, and Llama-3.1-8B), and found that the orchestrated multi-agent system with GPrompt consistently achieved the highest F1 scores, with larger gains in smaller models. These findings demonstrate that structured multi-agent reasoning improves accuracy beyond prompt engineering alone and offers a transparent, clinically aligned approach for explainable decision support in secondary headache diagnosis.


Statistical NLP for Optimization of Clinical Trial Success Prediction in Pharmaceutical R&D

Doane, Michael R.

arXiv.org Artificial Intelligence

This work presents the development and evaluation of an NLP-enabled probabilistic classifier designed to estimate the probability of technical and regulatory success (pTRS) for clinical trials in the field of neuroscience. While pharmaceutical R&D is plagued by high attrition rates and enormous costs, particularly within neuroscience, where success rates are below 10%, timely identification of promising programs can streamline resource allocation and reduce financial risk. Leveraging data from the ClinicalTrials.gov database and success labels from the recently developed Clinical Trial Outcome dataset, the classifier extracts text-based clinical trial features using statistical NLP techniques. These features were integrated into several non-LLM frameworks (logistic regression, gradient boosting, and random forest) to generate calibrated probability scores. Model performance was assessed on a retrospective dataset of 101,145 completed clinical trials spanning 1976-2024, achieving an overall ROC-AUC of 0.64. An LLM-based predictive model was then built using BioBERT, a domain-specific language representation encoder. The BioBERT-based model achieved an overall ROC-AUC of 0.74 and a Brier Score of 0.185, indicating its predictions had, on average, 40% less squared error than would be observed using industry benchmarks. The BioBERT-based model also made trial outcome predictions that were superior to benchmark values 70% of the time overall. By integrating NLP-driven insights into drug development decision-making, this work aims to enhance strategic planning and optimize investment allocation in neuroscience programs.


Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Churina, Svetlana, Chebrolu, Niranjan, Jaidka, Kokil

arXiv.org Artificial Intelligence

Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth. We introduce Layer of Truth, a framework and dataset for probing belief dynamics in continually trained LLMs. By injecting controlled amounts of poisoned data and probing intermediate representations across checkpoints, model scales, and question types, we quantify when and how factual beliefs shift. Our findings reveal that even minimal exposure can induce persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes. These results highlight an overlooked vulnerability of continually updated LLMs: their capacity to internalize misinformation analogously to humans, underscoring the need for robust monitoring of factual integrity during model updates.



Chatbait Is Taking Over the Internet

The Atlantic - Technology

Hours deep into a recent migraine, I turned to ChatGPT for help. "How do I get my headache to stop?" The bot suggested that I drink water and pop a Tylenol--both of which I had already tried, and neither of which had helped. ChatGPT then made a tantalizing offer: "If you want, I can give a quick 5-minute routine right now to stop a headache No fear, the chatbot had a new plan: "If you want, I can give a '2-minute micro version' ' that works even if your headache is severe," the bot volunteered. Lately, chatbots seem to be using more sophisticated tactics to keep people talking. In some cases, like my request for headache tips, bots end their messages with prodding follow-up questions. In others, they proactively message users to coax them into conversation: After clicking through the profiles of 20 AI bots on Instagram, all of them DM'ed me first. Days later, my phone pinged: "bestie " wanted to chat. Maybe this approach to engagement sounds familiar. Clickbait is already everywhere online--whether it's sensationalist headlines (" The Shocking Fact About American History That 95 Percent of Harvard Graduates Get Wrong ") or exaggerated video thumbnails (see: " YouTube face "). Chatbots are now headed in a similar direction. As AI takes over the web, clickbait is giving way to chatbait. Some bots appear to be more guilty of chatbait than others. When I ditched ChatGPT and asked Google's Gemini for headache help, it offered a long list of advice, then paused without asking any follow-ups. Anthropic's Claude wanted to know whether my headache was tension-related, due to sinus pressure, or something else entirely--hardly a goading question. That's not to say that these other bots never respond with chatbait. Chatbots tend to be sycophantic: They often flatter and sweet-talk users in a way that encourages people to keep talking. But, in my experience, ChatGPT goes a step further, stringing users along with unrequited offers and provocative questions. When I told the chatbot I was thinking of getting a dog, it offered to make a " Dog Match Quiz " to help decide the perfect breed. Later, when I complimented ChatGPT's emoji use, it volunteered to make me "a single'signature combo' How could I decline that?


Exploring Gender Differences in Chronic Pain Discussions on Reddit

Andrade, Ancita Maria, Banerjee, Tanvi, Mundugar, Ramakrishna

arXiv.org Artificial Intelligence

Pain is an inherent part of human existence, manifesting as both physical and emotional experiences, and can be categorized as either acute or chronic. Over the years, extensive research has been conducted to understand the causes of pain and explore potential treatments, with contributions from various scientific disciplines. However, earlier studies often overlooked the role of gender in pain experiences. In this study, we utilized Natural Language Processing (NLP) to analyze and gain deeper insights into individuals' pain experiences, with a particular focus on gender differences. We successfully classified posts into male and female corpora using the Hidden Attribute Model-Convolutional Neural Network (HAM-CNN), achieving an F1 score of 0.86 by aggregating posts based on usernames. Our analysis revealed linguistic differences between genders, with female posts tending to be more emotionally focused. Additionally, the study highlighted that conditions such as migraine and sinusitis are more prevalent among females and explored how pain medication affects individuals differently based on gender.


Cryptogenic stroke and migraine: using probabilistic independence and machine learning to uncover latent sources of disease from the electronic health record

Betts, Joshua W., Still, John M., Lasko, Thomas A.

arXiv.org Artificial Intelligence

Migraine is a common but complex neurological disorder that doubles the lifetime risk of cryptogenic stroke (CS). However, this relationship remains poorly characterized, and few clinical guidelines exist to reduce this associated risk. We therefore propose a data-driven approach to extract probabilistically-independent sources from electronic health record (EHR) data and create a 10-year risk-predictive model for CS in migraine patients. These sources represent external latent variables acting on the causal graph constructed from the EHR data and approximate root causes of CS in our population. A random forest model trained on patient expressions of these sources demonstrated good accuracy (ROC 0.771) and identified the top 10 most predictive sources of CS in migraine patients. These sources revealed that pharmacologic interventions were the most important factor in minimizing CS risk in our population and identified a factor related to allergic rhinitis as a potential causative source of CS in migraine patients.


Clinical knowledge in LLMs does not translate to human interactions

Bean, Andrew M., Payne, Rebecca, Parsons, Guy, Kirk, Hannah Rose, Ciro, Juan, Mosquera, Rafael, Monsalve, Sara Hincapié, Ekanayaka, Aruna S., Tarassenko, Lionel, Rocher, Luc, Mahdi, Adam

arXiv.org Artificial Intelligence

Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.


AutoMedPrompt: A New Framework for Optimizing LLM Medical Prompts Using Textual Gradients

Wu, Sean, Koo, Michael, Scalzo, Fabien, Kurtz, Ira

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated increasingly sophisticated performance in medical and other fields of knowledge. Traditional methods of creating specialist LLMs require extensive fine-tuning and training of models on large datasets. Recently, prompt engineering, instead of fine-tuning, has shown potential to boost the performance of general foundation models. However, prompting methods such as chain-of-thought (CoT) may not be suitable for all subspecialty, and k-shot approaches may introduce irrelevant tokens into the context space. We present AutoMedPrompt, which explores the use of textual gradients to elicit medically relevant reasoning through system prompt optimization. AutoMedPrompt leverages TextGrad's automatic differentiation via text to improve the ability of general foundation LLMs. We evaluated AutoMedPrompt on Llama 3, an open-source LLM, using several QA benchmarks, including MedQA, PubMedQA, and the nephrology subspecialty-specific NephSAP. Our results show that prompting with textual gradients outperforms previous methods on open-source LLMs and surpasses proprietary models such as GPT-4, Claude 3 Opus, and Med-PaLM 2. AutoMedPrompt sets a new state-of-the-art (SOTA) performance on PubMedQA with an accuracy of 82.6$\%$, while also outperforming previous prompting strategies on open-sourced models for MedQA (77.7$\%$) and NephSAP (63.8$\%$).